Assessing clinical applicability of COVID-19 detection in chest radiography with deep learning

The coronavirus disease 2019 (COVID-19) pandemic has impacted healthcare systems across the world. Chest radiography (CXR) can be used as a complementary method for diagnosing/following COVID-19 patients. However, experience level and workload of technicians and radiologists may affect the decision process. Recent studies suggest that deep learning can be used to assess CXRs, providing an important second opinion for radiologists and technicians in the decision process, and super-human performance in detection of COVID-19 has been reported in multiple studies. In this study, the clinical applicability of deep learning systems for COVID-19 screening was assessed by testing the performance of deep learning systems for the detection of COVID-19. Specifically, four datasets were used: (1) a collection of multiple public datasets (284.793 CXRs); (2) BIMCV dataset (16.631 CXRs); (3) COVIDGR (852 CXRs) and 4) a private dataset (6.361 CXRs). All datasets were collected retrospectively and consist of only frontal CXR views. A ResNet-18 was trained on each of the datasets for the detection of COVID-19. It is shown that a high dataset bias was present, leading to high performance in intradataset train-test scenarios (area under the curve 0.55–0.84 on the collection of public datasets). Significantly lower performances were obtained in interdataset train-test scenarios however (area under the curve > 0.98). A subset of the data was then assessed by radiologists for comparison to the automatic systems. Finetuning with radiologist annotations significantly increased performance across datasets (area under the curve 0.61–0.88) and improved the attention on clinical findings in positive COVID-19 CXRs. Nevertheless, tests on CXRs from different hospital services indicate that the screening performance of CXR and automatic systems is limited (area under the curve < 0.6 on emergency service CXRs). However, COVID-19 manifestations can be accurately detected when present, motivating the use of these tools for evaluating disease progression on mild to severe COVID-19 patients.

www.nature.com/scientificreports/ Contributions. The goals of this study are thus twofold: (i) to develop an automatic COVID-19 detection method in CXR to serve as a 2nd opinion to support clinical decisions in the triage of COVID-19 patients; (ii) to assess the clinical applicability of deep learning systems for COVID-19 screening using CXR images. In particular, we contribute to the development of more robust COVID-19 automatic detection methods by: • Critically comparing the intra-and inter-dataset performance in both public and in-house datasets of a deep learning model trained following similar methods to the ones proposed during the worldwide pandemic outbreak. In particular, we show that performance claims on the literature are overconfident due to dataset bias; • Building on the last topic, we show that using annotations from medical experts can significantly mitigate dataset bias, allowing the model to obtain a similar COVID-19 screening performance based solely on radiological manifestations of the disease to radiologists; • To help the development and validation of future algorithms, we make publicly available the radiologists' annotations on the public datasets. Figure 2 shows a summary of the study. In specific, a comparison of the performance of a ResNet-18 trained on different datasets with the annotations of two experienced radiologists is performed. It is also shown that the introduction of field-knowledge during finetuning allows to avoid the dataset bias inherent to previous solutions, improving both the system's performance and the significance of the explanations extracted from the model. Finally, the finetuned model is tested on an external set of images aimed at replicating a clinical environment.

Methods
Datasets. Three public datasets and one private datasets were used in this study. All data was collected retrospectively. Note that for the public datasets, the exact criteria for inclusion in the dataset and referral for CXR are unknown. For all datasets, the same processing and criteria were applied to ensure uniformity. Only frontal CXRs-postero-anterior (PA) and antero-posterior (AP)-were included and CXRs were divided into 3 classes: Normal, Pathological (not COVID- 19) and COVID- 19. Ground truth labels for all CXRs were obtained from the ground truth available on each dataset. The COVID-19 label corresponds to a positive SARS-CoV-2 RT-PCR result and not necessarily to the presence of radiological features of COVID-19. Table 1 shows the distribution of the number of images per dataset and class after exclusion of non-frontal CXRs. Table 2 shows the patient and CXR acquisition characteristics for each dataset after exclusion of non-frontal CXRs. This information was extracted from the metadata available for each dataset or individual DICOM metadata. Note that patient and CXR acquisition characteristics are not available for every CXR. The reader is referred to the Additional Information section for details on dataset access.
Mixed dataset. The Mixed dataset is a combination of multiple public CXR datasets, similar to those used in most early publications on automatic COVID-19 detection in CXR 10,12,14 , combining pre-pandemic public datasets with recent COVID-19 positive CXRs, mostly extracted from academic articles and online publications.
Normal and Pathological CXRs were obtained from the CheXpert 18 , ChestXRay-8 19 and Radiological Society of North America Pneumonia Detection Challenge (RSNA-PDC) 20 datasets. COVID-19 positive cases and a residual amount of Normal and Pathological cases were extracted from online repositories of CXRs, namely the COVID-19 IDC 16 and COVIDx 10 datasets. Further CXRs were obtained by manual extraction of images published online on Twitter and the Sociedad Española de Radiologia Médica (SERAM) website. Finally, COVID-19 positive CXR were obtained from the COVID DATA SAVE LIVES dataset, made available by the HM Hospitales. Given that there is significant overlap in some of the datasets included in the Mixed dataset, repeated images were excluded.     CXR annotation. In order to evaluate the performance of radiologists in the detection of COVID-19 radiological features in CXR, manual annotation of a subset of CXR images from each dataset was performed by two radiologists using an in-house software. The software presented CXRs from a randomly selected subset and allowed for window center/width adjustment, zooming and panning. Radiologists were asked to label CXRs into one of 4 classes: Normal, Not indicative of COVID-19 (pathological), Indicative of COVID-19 and Undetermined. The Indicative of COVID-19 class was defined as CXRs where the patient presented findings indicative of COVID-19, namely bilateral pulmonary opacities of low/medium density. The Undetermined class was defined as CXRs where the patient presented findings that could be indicative of COVID-19 but which could also be indicative of another condition, namely unilateral lung opacities, diffuse bilateral opacities of ARDS pattern or diffuse reticular opacities. The Not indicative of COVID-19 (pathological) class was defined as CXRs where the patient presented findings indicative of any other pathology except for COVID-19. CXRs where the patient presented medical devices were classified as Normal if the underlying pathology was not visible. Additionally, CXRs without sufficient quality for visual assessment by the radiologists due to bad image quality, patient positioning or any other factors could be labelled as Compromised for exclusion. Manual labelling of CXRs was performed in two stages. First, both radiologists independently classified each CXR. CXRs where the two radiologists disagreed were then selected for the second stage where the two radiologists assessed the CXRs together to achieve consensus. At no point were radiologists given access to the ground truth label, RT-PCR results or any other information besides the CXR image.
To ensure that written information present in the CXR image (such as hospital system, health service, laterality markers, patient positioning, etc.) did not bias the annotation, all written labels were blacked out during before annotation. This was done in a semi-automatic way using a YOLOv3 24 architecture for the detection of written labels in CXRs. For this purpose, 317 CXRs were randomly selected from the Mixed dataset and bounding boxes were manually drawn around all written labels. The network was then trained on the 317 CXRs (733 bounding boxes). Previous to CXR annotation by the radiologists, all CXRs were visually inspected and any missing or incorrect bounding boxes were corrected manually. All annotations on the public datasets are available at https:// doi. org/ 10. 25747/ 342B-GF87.
Automatic CXR COVID-19 detection. For automatic COVID-19 detection in CXR, an 18-layer deep residual neural network architecture (ResNet18) 25 was used in all experiments. While larger networks such as DenseNet121 have shown excellent performance in CXR disease classification 18 , the smaller ResNet18 was chosen to try to reduce overfitting to the limited data in the COVID-19 class and the inherent bias of the existing datasets.
The ResNet18 architecture used was identical to the architecture proposed in He et al. 25 , except that the input is a 1-channel gray image (512 × 512 pixels) and the number of output nodes is the number of classes being considered.
The loss function used during training was the weighted binary cross-entropy 26 : www.nature.com/scientificreports/ where C is the number of classes, y c is one when c is the ground truth class and p c is the model prediction for class c. To balance the effect of different classes and avoid bias, class weights w c = 1 − N c /N were applied, where N is the total number of images in the training set and N c the number of images in the training set of class c. Model optimization was performed using Adam 27 with a learning rate of 0.0001 and a batch size of 24. These values were chosen empirically based on previous experiments and hardware capacity. Given the large number of images in the Mixed dataset, an epoch was defined as 1200 batches (approximately one tenth of the dataset) and all models were trained for a maximum of 100 epochs with a patience of 10 epochs as an early stopping criterion.
To estimate the performance of the predictive model in practice and avoid a possible bias due to random division of the data, a 5-fold cross-validation scheme was performed during training and testing. Folds were constructed through random CXR selection so that the class and dataset distributions in each fold are similar to the full dataset and so that all CXRs from each patient are placed in the same fold. In interdataset settings, i.e. when a model trained on dataset A is tested on dataset B, the full dataset B is used for testing.

Experiments
CXR annotation. A total of 2,442 CXRs were selected for annotation by the two radiologists. Of these, 1256 belong to the Mixed dataset, 289 belong to BIMCV, 300 belong to COVIDGR and 597 belong to CHVNGE (distribution per classes shown in Table 1). Selection of CXRs for annotation was performed randomly: for the Mixed dataset, a balanced selection strategy was used during image selection, whereas for BIMCV, COVIDGR and CHVNGE, the dataset class distribution was maintained in the subset selected for annotation. Due to practical reasons, the independent reading of CXRs by each radiologist was not possible in all cases-of the 2,442 CXRs annotated, 799 were annotated in consensus without independent reading.
During annotation, a total of 77 CXRs (25 Normal, were randomly selected for repeated annotation to determine intraobserver variability. The repeated images were mixed in with new images in a 1:10 ratio during annotation. Radiologists were unaware that repeated images were being introduced to avoid bias. Model training. Baseline training. The network was first initialized with weights from an Imagenet pretrained model 28 (the weights in the first layer were taken from the red channel on the pretrained model). The model was first trained for binary Normal vs Not Normal classification and only then trained with three output nodes corresponding to the Normal, Pathological and COVID-19 classes where the Pathological class included all classes except the Normal and COVID-19 classes. This two-step training strategy aims at leading the model to learn CXR-related features prior to learning COVID-19-related features, increasing feature relevance while reducing overfitting. This model was trained using the Mixed dataset to allow a direct comparison to previous studies on COVID-19 detection in CXR and will be referred to as M Mixed .
Dataset finetuning. In order to estimate a best case scenario in terms of performance for each dataset, M Mixed was then retrained on each of the other single-source datasets-BIMCV, COVIDGR and CHVNGE. By having access to data from each dataset, dataset-specific features can be learned, improving performance. Naturally, finetuned models will also be more subject to dataset bias and learning shortcuts that misrepresent COVID-19 manifestations. These models are hereinafter referred to as M BIMCV , M COVIDGR and M CHVNGE respectively.
Pseudo-labelling. Given the characteristics of the Mixed dataset and the high correlation between dataset sources and classes, it is expected that M Mixed model learns not only radiological features of COVID-19 but also that it relies heavily on learning shortcuts 17 . In order to decrease this effect and improve the learning of radiological features, a pseudolabelling strategy 29 was implemented. This was performed by obtaining the predictions of the model trained on the Mixed dataset on all images of the same dataset. All images with predicted probability lower than 0.9 were then removed to include only high confidence predictions and the predicted class was then used as ground truth label for retraining the model. This model is referred as M MPseudo .
Radiologist annotations. Given that not all COVID-19 positive cases present manifestations on CXR, considering these cases as COVID-19 during training can introduce high levels of noise and promote the learning of shortcuts for classification, which do not represent COVID-19 manifestations. Using radiologist annotations during training avoids this issue as COVID-19 positive CXRs without manifestations will be presented to the model as Normal CXRs during training, enforcing the learning of features that represent COVID-19 manifestations. For this purpose, M Mixed was retrained using the labels given by the radiologists during manual annotation of the Mixed dataset. Regarding the CXRs annotated as Undetermined, the principle of precaution was applied: given the definition of this class as CXRs where the patient presented findings that could be indicative of COVID-19 but which could also be indicative of another condition, these CXR were considered to belong to the COVID-19 class. CXRs marked as Compromised were discarded. This model is hereinafter referred to as M MAnnot .
Performance evaluation. The agreement between radiologist annotations and ground truth labels in the detection of COVID-19 was evaluated in terms of precision (Prec.) and recall whereas the intra-and interobserver variability were evaluated in terms of accuracy (Acc.) and Cohen's kappa ( κ) 30 : Model performance was evaluated with receiver operating characteristic (ROC) curve, specifically in terms of AUC. Confidence intervals were computed taking into account the achieved average performance for all folds. To further validate the models, Grad-CAM++ 32 was used to visualize the location of the regions responsible for the network predictions when necessary. Finally, the calibration of the model's prediction was assessed using the Expected Calibration Error (ECE) 33 . The ECE is a summary of the Reliability diagram, which plots the expected accuracy as function of the predicted class probability. Briefly, all N test samples are binned probability-wise in B groups for which the accuracy Acc b and average confidence p b for the corresponding reference label are computed. ECE is the weighted average of the difference between these two measures: where N b is the number of samples in bin b. Results are evaluated considering 10 bins.
The statistical significance of the differences in the performance of the models (and the radiologists) was performed according to the DeLong test, which allows for paired comparison of AUCs 34 . When comparing AUCs across different test sets, the permutation test for continuous unpaired comparison of AUCs proposed in Venkatraman et al. 35 was used. Note that when comparing a model to radiologist annotations, only the subset of CXRs annotated by radiologists was taken into account. Statistical significances computed across folds were fused according to Fisher's combined probability test 36 to obtain a single p-value and when performing multiple comparisons, statistical significance was considered after applying the Bonferroni correction 37 .

Results
CXR annotation. Figure 3 shows the confusion matrices between ground truth labels and radiologist annotations in each of the datasets, as well as interobserver variability and intraobserver variability. The performance of radiologists in COVID-19 detection in terms of precision and recall is shown in Table 3. As expected, considering as positives only CXRs marked as Indicative of COVID-19 (C) gives a higher average precision but with low recall, whereas including as positives CXRs marked as Undetermined (C + U) significantly increases recall, but at the expense of precision. Comparing across datasets, it can be seen that radiologists achieve the highest precision and recall on the COVIDGR and Mixed datasets, whereas the lowest performance is obtained for the CHVNGE dataset. Table 4 shows the inter-and intraobserver variabilities for all annotated CXRs. A statistically significant difference ( p = 0.0021 ) was found between the two observers when considering as positives only the CXRs marked as Indicative of COVID-19, whereas other combinations could not be performed at power ≥ 0.8 due to the reduced sample size.
Automatic CXR COVID-19 detection. Figure 4 shows the ROC curves of all trained models on each dataset, as well as comparison to radiologist annotations after consensus. Table 5 shows the AUC of each model in all datasets and Table 6 shows the statistical significance of differences in AUC between readers (trained models and radiologists) for each dataset. Table 7 shows the model calibration metric ECE for each dataset, model and fold.
On the test set, it can be seen that intradataset train-test scenarios obtained the best results on all datasets, i.e. when training includes CXRs from the dataset used on testing (dashed lines in Fig. 4). This was most evident on the Mixed, BIMCV and CHVNGE datasets with differences in AUC between M Mixed and other models on the Mixed dataset, between M BIMCV and other models on the BIMCV dataset and between M CHVNGE and other models on the CHVNGE dataset all being statistically significant with p < 0.0001 . On COVIDGR, differences in www.nature.com/scientificreports/ ROCs were less stark. The differences between M COVIDGR and M Mixed and M COVIDGR were statistically significant for p < 0.0015 but differences between M COVIDGR and other models were less significant. On the subset of annotated CXRs, trained models outperformed radiologists on intradataset train-test scenarios, particularly on the Mixed and BIMCV datasets. Differences in performance between radiologists and each of the models yielded statistically significant differences ( p < 0.0001 ) for all models on the Mixed dataset and for M BIMCV on the BIMCV dataset. On COVIDGR and CHVNGE, differences between radiologists and M COVIDGR and M CHVNGE were less significant ( p = 0.0137 and p = 0.03691 ). On interdataset train-test, model performance is typically lower than or similar to that of radiologists. On CHVNGE, where model performance was lowest, only M MAnnot can achieve a performance close to that of radiologists.  www.nature.com/scientificreports/ Real-world application. Figure 6 shows the performance of M MAnnot and the radiologist annotations for each CXR equipment used in acquisition on CHVNGE, which correspond to the different hospital services as outlined in Section "Datasets". Table 8 shows the AUC for every cross validation fold for each CXR equipment. M MAnnot was chosen for this analysis as it was the best performing method on CHVNGE, excluding the model finetuned on CHVNGE. Table 9 shows the statistical significance of differences in M MAnnot AUC between CXR equipments. Table 10 shows the statistical differences in ROC between M MAnnot and radiologists on the subset of annotated CXRs. It can be seen that performance is higher for CXRs acquired with Carestream and FUJI CR, followed by FUJI DX. The lowest AUC is obtained for CXRs obtained with Samsung with a statistical significant difference to FUJI CR and Carestream ( p < 0.0046 ). The performance of radiologists follows the same trend as M MAnnot , with the lowest sensitivity found for Samsung CXRs. Nevertheless, radiologists showed a significantly superior AUC for Samsung CXRs ( p = 0.0398 ) and significantly inferior performance on FUJI CR CXRs ( p = 0.0415).

CXR annotation. The radiologists' recall for Indicative of COVID-19 and
Undetermined is in line with other studies 15 . On the other hand, the recall for Indicative of COVID-19 is lower, suggesting that the Indicative of COVID-19 labelling protocol is overly conservative as it only includes CXRs where radiologists were fairly certain that the patient presented COVID-19 infection. Regarding precision, the differences observed between datasets are likely related to the characteristics of each dataset. It can be seen that on both the Mixed and BIMCV datasets, false positive COVID-19 annotations mostly occur for pathological non-COVID-19 cases (Fig. 3a) and rarely occur for normal patients. As such, for datasets where images originate from multiple sources and represent a wider range of pathologies confoundable with COVID-19, precision is lower. While the Normal/Pathological distribution is not known for COVIDGR, 82% of non-COVID-19 cases were annotated as Normal by the radiologists, which indicates that the percentage of Pathological cases is significantly lower than on other datasets and is responsible by the high precision values obtained. Furthermore, it shows how dataset characteristics can bias the performance obtained by radiologists but also for models being tested on these datasets.
Interestingly, the radiologists' performance on the CHVNGE dataset is lower than for the other datasets. This suggests that the CHVNGE dataset is more challenging and that the public datasets misrepresent the different COVID-19 stages in comparison to the clinical reality in CHVNGE. This is particularly true for the Mixed dataset, which is known to mostly include severe COVID-19 patients 23 . On the other hand, CHVNGE may present a higher prevalence of early stage COVID-19 cases, which have limited radiological manifestations, resulting in lower recall. This hypothesis is corroborated by Fig. 6, where the experts' performance on images acquired in inpatient services and the intensive care unit is higher than on images acquired on initial patient screening in the emergency department.
Regarding the inter-and intraobserver variability, high accuracy ( ≥0.81) was obtained with however only moderate κ values, particularly for interobserver variability when considering as positives C. Analysing Fig. 3b, it can be seen that while negative Indicative of COVID-19 CXR annotations are consistent, CXRs annotated by at least one of the radiologists as Indicative of COVID-19 are much less consistent, with radiologist 1 being in general more conservative and attributing label Undetermined to a high proportion of CXRs annotated as Indicative of COVID-19 by radiologist 2. This difference in agreement is however, expected since the Undetermined label corresponds to borderline cases, where the main radiological manifestations of COVID-19 are not obvious or complete. Consequently, decisions in these cases may vary more frequently. This lack of consistency is however less clear when considering as positives C + U, where a reasonably higher value of κ is obtained.
Automatic CXR COVID-19 detection. The AUC differences found between inter-and intradataset traintest scenarios (Table 5) corroborate the findings of DeGrave et al. 17 . Even though experiments were designed with patient-wise stratified cross-validation, the performance of the models in intradataset train-test scenarios was always higher than when different datasets were used for training and testing. This suggests that the deep learning system is not relying exclusively on radiological features to perform image classification, and is instead Table 4. Inter-and intraobserver variability of radiologist annotations considering as positives only CXRs marked as Indicative of COVID-19 (C) or Indicative of COVID- 19 and Undetermined (C + U). χ 2 : p-value obtained with the McNemar test (NP: power< 0.8). www.nature.com/scientificreports/ www.nature.com/scientificreports/ www.nature.com/scientificreports/ partially overfitting to other acquisition details. This further highlights the need to carefully validate systems prior to announcing (near-)human performance. On the other hand, the studied finetuning approaches helped to mitigate the overfit behaviour. Indeed, results suggest that revisiting cases where COVID-19 radiological manifestations are more evident helps the model converge to a feature representation that better encodes the radiological manifestations of the pathology. In fact, both M MPseudo and M MAnnot are able to outperform M Mixed in all external datasets (Table 5), without requiring additional training data. The difference in performance in CHVNGE is particularly significant for M MAnnot ( p < 0.0001 ), where the annotations performed by the radiologists were used to finetune the system. As shown in Fig. 4, training with a selection of images from the Mixed dataset known to contain COVID-19 features allowed the model to approximate its performance to the human experts on the CHVNGE dataset without the need to introduce images from that dataset. The hypothesis that finetuning with good quality labels improves the system's reliability is further supported by the models' calibration performance (Table 7). Interestingly, the lowest ECE values are achieved when using finetuning with the CHVNGE dataset ( M CHVNGE ) and using the expert annotations ( M MAnnot ). As previously discussed, the CHVNGE dataset may better represent the clinical reality and thus have a higher diversity and progression stages of COVID-19 radiological manifestations, promoting a less binarized output of the model. Likewise, M MAnnot shows low ECE values for all datasets. This further corroborates that adjusting the model's weights using the CXRs annotated by the radiologists mitigates overfitting by redefining the solution space and allowing to dampen previously overconfident incorrect predictions.
Finally Real-world application. The performance of both M MAnnot and the radiologists (Indicative of COVID-19 and Undetermined) is higher in images from intensive care units and inpatient services in comparison to the emergency department (Fig. 6). As previously suggested, images from patients with late stage COVID-19 are expected to be easier to distinguish from other pathologies because the radiological manifestations resulting from the infection are more visible. This further reinforces the need to contextualize the acquisition setting when reporting model performances. While it has been repeatedly suggested in literature that a system such as the one proposed in this study could be used as an early screening tool, it is clear in this study that even M MAnnot , the best performing model in CHVNGE without dataset finetuning, has a much lower performance than what has been suggested in literature and than what can be ascertained from available public datasets where significantly higher   www.nature.com/scientificreports/ AUCs have been reported (AUCs ≥ 0.9 ). Instead, systems such as this are perhaps put to better use as an evaluation tool of the progress of severe COVID-19 infections, reducing the workload of intensivists and radiologists on intensive care units by providing an objective opinion of the disease's progression.

Main findings and limitations.
As discussed in the Introduction section, several methods for automatic COVID-19 diagnosis in CXR have been proposed in literature. Particularly in the beginning of the pandemic extremely high performances have also been reported (see Fig. 1) 9 . These results have been replicated in this study on intradataset train-test scenarios (e.g. M Mixed tested on the Mixed dataset). However performance in interdataset train-test scenarios was found to be much lower, likely due to significant dataset bias. In these scenarios, most trained models could not achieve the performance of radiologists, particularly on the CHVNGE dataset, which more closely represents clinical reality. When finetuned with radiologist annotations however, M MAnnot showed a more consistent performance across datasets and a much closer ROC to that of radiologists on CHVNGE. Finally, performance on different hospital services (through CXR equipments) was studied, showing that the performance of an automatic system for COVID-19 detection in CXR is nevertheless underwhelming and research efforts should be directed towards, for example, evaluation of progress and severity of disease in inpatient/intensive care units.
In spite of the promising results obtained in this study, there are limitations that must be taken into account in the interpretation of results and future directions. As highlighted in this manuscript, the training of deep learning systems relies heavily on the available data and while this study includes data from several sources to achieve a good representation of multiple environments, the dataset which intends to represent clinical reality is limited. For one, it represents the reality of a single hospital system in Portugal, which may limit the reproducibility of these results in other hospitals. Although we believe that the performance differences reported on Table 8 are meaningful and justifiable, it would still be of interest to corroborate our findings on additional data sources. Furthermore, the data from CHVNGE represents a limited scope in time, which is particularly relevant given the rapid changes of COVID-19 since its appearance. As such, to properly evaluate the true clinical impact of this type of systems, different time points would need to be considered. Finally, there is no guarantee that the achieved performance and model behaviour is reproducible for different network architectures. Indeed, in this study we opted for using a ResNet architecture which, as highlighted in Fig. 1, accounts for approximately 25% of the proposed methods during the initial outbreak. However, different network architectures may have different generalization capabilities and robustness to overfit. Given the impossibility of assessing all available architectures, we aim at raising awareness for the need to properly train and evaluate a model's performance and thus avoid overconfident claims.

Conclusion
This study assessed the performance of a deep learning system for COVID-19 screening using CXR and compared it with expert radiologists. The detection of COVID-19 in CXR images is non-trivial due to the wide range of radiological manifestations associated with the infection. Consequently, radiologists tend to confound COVID-19 patients with other pathologies. Similarly to other recent studies, it was found that the performance reported for deep learning approaches is overconfident. Indeed, this study shows that the screening performance is not robust to changes in data origin. However, the results shown suggest that finetuning the model with labels provided by human experts allows the network to improve the quality and meaningfulness of the extracted features, improving explainability and reducing data bias.
Results also suggest that the applicability of these systems for initial patient triage, when radiological manifestations of COVID-19 are minimal, is limited. However, when radiological manifestations of COVID-19 are present, these can be accurately detected and pinpointed by these tools. Although the achieved results are promising, there is still need to understand how well these findings translate to other time points/variants of COVID-19 and different clinical realities. Based on this study, future directions in this field should also focus on the use of deep learning systems for tracking the evolution of mild to severe COVID-19 infections, providing a robust 2nd opinion and thus contributing to mitigate the consequences of the pandemic. www.nature.com/scientificreports/ in Section"Datasets". However, the CHVNGE Ethical Committee determined that the data cannot be used beyond the purpose of the current study, and thus cannot be shared publicly with other institutions.